All i can say is …phoooffffff … what a data set to work on …was a good learning experience though…
There were 3 data sets involved in this project, boing-boing-candy-2015.xlsx, boing-boing-candy-2016.xlsx and boing-boing-candy-2017.xlsx.
To clean it I did the following :
read_excel functionmumate functionclean_variable functionpivot_longer function.select function.names function.mutate, as.integer and ifelse statement.2016 and 2017.Here is my code:
library(tidyverse)
library(here)
library(janitor)
READING THE DATA-SET
candy_data <- read_csv(here("clean_data/candy_all_years_clean.csv"),
col_types = cols(
gender = col_character(),
country = col_character()
))
Analysis questions
1> What is the total number of candy ratings given across the three years. (number of candy ratings, not number of raters. Don’t count missing values)
candy_data %>%
filter(!is.na(rating)) %>%
summarise(total_no_of_candy_ratings = n())
## # A tibble: 1 x 1
## total_no_of_candy_ratings
## <int>
## 1 762355
2> What was the average age of people who are going out trick or treating and the average age of people 3. not going trick or treating?
candy_data %>%
group_by(trick_or_treat_yourself) %>%
summarise( average_age = mean(age,na.rm = TRUE))
## # A tibble: 3 x 2
## trick_or_treat_yourself average_age
## <chr> <dbl>
## 1 No 39.2
## 2 Yes 35.1
## 3 <NA> 42.3
3> For each of joy, despair and meh, which candy bar revived the most of these ratings?
candy_data %>%
filter(!is.na(rating))%>%
group_by(rating, candy_name) %>%
summarise( count = n()) %>%
filter(count == max(count))
## # A tibble: 3 x 3
## # Groups: rating [3]
## rating candy_name count
## <chr> <chr> <int>
## 1 DESPAIR broken_glow_stick 7905
## 2 JOY any_full_sized_candy_bar 7589
## 3 MEH lollipops 1570
4> How many people rated Starburst as despair?
candy_data %>%
filter(candy_name == "starburst", rating == "DESPAIR") %>%
summarise (total_starburst_despair = n())
## # A tibble: 1 x 1
## total_starburst_despair
## <int>
## 1 1990
For the next three questions, count despair as -1, joy as +1 and meh as 0.
5> What was the most popular candy bar by this rating system for each gender in the dataset? # First lets create a seperate column called emotion_numeric using the recode function
candy_data_numeric <- candy_data %>%
mutate(rating_numeric = recode(rating, DESPAIR = -1, JOY = 1, MEH = 0))
candy_data_numeric %>%
group_by(gender, candy_name)%>%
summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
filter(sum == max(sum))
## # A tibble: 5 x 3
## # Groups: gender [5]
## gender candy_name sum
## <chr> <chr> <dbl>
## 1 Female any_full_sized_candy_bar 875
## 2 I'd rather not say any_full_sized_candy_bar 68
## 3 Male any_full_sized_candy_bar 1584
## 4 Other any_full_sized_candy_bar 37
## 5 <NA> any_full_sized_candy_bar 4618
6> What was the most popular candy bar in each year?
candy_data_numeric %>%
group_by(year, candy_name)%>%
summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
filter( sum == max(sum))
## # A tibble: 3 x 3
## # Groups: year [3]
## year candy_name sum
## <dbl> <chr> <dbl>
## 1 2015 any_full_sized_candy_bar 4603
## 2 2016 any_full_sized_candy_bar 1037
## 3 2017 any_full_sized_candy_bar 1542
7> What was the most popular candy bar by this rating for people in US, Canada, UK and all other countries?
candy_data_numeric%>%
mutate(country = if_else(country %in% c("united states", "canada", "united kingdom"), country, "other")) %>%
group_by(country, candy_name)%>%
summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
filter( sum == max(sum))
## # A tibble: 5 x 3
## # Groups: country [4]
## country candy_name sum
## <chr> <chr> <dbl>
## 1 canada any_full_sized_candy_bar 254
## 2 other any_full_sized_candy_bar 5761
## 3 united kingdom lindt_truffle 15
## 4 united kingdom rolos 15
## 5 united states any_full_sized_candy_bar 1153